Introduction to radiomics
Image display
Radiomic features
Segmentation
4.1. Segmentation with lungct
4.2. Segmentation with Python (I)
4.3. Segmentation with Python (II)
4.4. Segmentation with lungmask
Radiomic features
5.1. Radiomic features with RIA
5.2. Radiomic features with pyradiomics
RaDAr object
6.1. Example creating RaDAr object
6.2. RaDAr object with L1
Data analysis
7.1. Clinical data
7.2. Principal Component Analysis (PCA)
7.3. Canonical correlation analysis (CCA)
7.4 COPD vs controls classification
7.5 Classification COPD types
Medical images are essential for studying diseases. For example, to examine how a patient’s tumor evolves, we perform multiple scans over time. Interpreting how the tumor has changed by looking at the images is not an easy task, and different doctors may predict different outcomes. This is where computers become very useful. Thanks to artificial intelligence techniques, we can show computers different medical images over time of a patient with a tumor, and they can identify features of these images to make predictions. They can help decide whether a treatment for a tumor is working or not, decide which alternative treatment to use, and much more, without having to wait for the tumor to get worse.
In the last decades, several areas of human activities have experienced an increase in digitization. In the case of medicine, a significant amount of information generated during clinical routine has been digitized. With this digital increase, new and better software has been developed to analyze the data. Thanks to research in Artificial Intelligence, these methods have become very powerful and available to any user, allowing doctors to use them on a daily basis.
As the amount of data increases, different Artificial Intelligence techniques - mainly Machine Learning and Deep Learning - are of high utility to deal with this large amount of data, an area known as “Big Data”. In simple terms, Big Data refers to sets of data whose size, complexity and speed of growth make it difficult to analyze using traditional tools.
Radiomics is a field of medical study whose purpose consists of extracting a large volume of features from medical images using data characterization algorithms. These features are known as radiomic features and can help to discover tumor patterns that are difficult for the human eye to analyze. Radiomics is a relatively new scientific filed. The first radiomics studies appeared in PubMed as recently as 2011.
It is believed that, in the end, radiomics will use specific treatments for each patient, will help doctors select the most appropriate treatment for each patient, and will be able to change treatments quickly if they do not work.
Some of the difficulties when performing radiomics research is that high quality images, with adequate size and complete datasets are needed. Different training and validation datasets are also necessary, to check if our algorithm works correctly. Another difficulty is class imbalance (classification problem where the number of observations per class is not equally distributed) and overfitting (when a statistical model exactly fits its training data). The main clinical applications of radiomics are radiogenomics and clinical outcome prediction.
Now we will briefly explain the process of radiomics.
In this project, we will study images obtained from scans of the lungs. The images we will obtain from the scans are DICOM images. DICOM (Digital Imaging and Communications in Medicine) is the standard for the communication and management of medical imaging information and related data. It is used worldwide to store, exchange, and transmit medical images. It defines the formats for medical images that can be exchanged with the data and quality necessary for clinical use. DICOM incorporates standards for imaging modalities such as radiography, ultrasonography, computed tomography (CT), magnetic resonance imaging (MRI), and radiation therapy. The most common applications of this standard are the display, storage, printing, and transmission of images.
To obtain the features, we can use multiple techniques. Radiomic techniques can be divided into four categories: intensity-based metrics, texture-based analysis, shape-based measures, and transform-based metrics. We will briefly discuss these techniques now.
Intensity-based metrics refers to statistics that are calculated from pixel values (or volumetric pixels called voxels). Additional information that can be obtained from analyzing the relationship between the voxels it is not considered. To compute the statistics, we select a region and extract the voxel values. They can be analyzed with histogram analysis. To quantify different aspects of the distribution we use average and variation, shape, and diversity.
Texture-based analysis refers to the analysis of image patterns, such as intensity, shape, or colors. Mathematical formulas based on the spatial relationship of voxels are used to quantify these concepts.
Shape-based measures refer to the study of different components of a structure. They can be divided into 1D metrics (measurement of the distance between two points. They are used to describe the magnitude of an abnormality), 2D metrics (calculated on cross-sectional planes and are used to calculate different parameters that are based on areas) and 3D metrics (attempt to enumerate different aspects of volumetric shape).
Transform-based metrics refers to the transformation of images from spatial coordinates to what is called a frequency domain, without losing any information.
We can obtain multiple types of features from images. Qualitative features are used to describe lesions, and quantitative features are extracted from images using computer programs that apply mathematical algorithms. Now, we will focus on quantitative features, which can be divided into different subgroups.
Shape features describe the shape of the traced region of interest and its geometric properties like volume, maximum diameter along different orthogonal directions, maximum surface, tumor compactness, etc.
First-order statistics features describe the distribution of individual voxel values without taking into account spatial relationships. Some properties we obtain are the mean, median, maximum, minimum values of the voxel intensities on the image, skewness, kurtosis, etc.
Second-order statistics features include the textural features. They are obtained computing the statistical relationships between neighboring voxels.
Higher-order statistics features are obtained by statistical methods after applying filters or mathematical transforms to the images. Some examples are fractal analysis, Minkowski functionals or Laplacian transforms of Gaussian-filtered images.
In this section we will display a few images with different extensions. But first of all, we will explain how we obtain the images.
We will be working with images that look like this:
To obtain these images, we emit x-rays through an x-ray tube. The rays pass through the lungs and reach a detector. Depending on the tissue it passes through, it absorbs more or less x-rays. The scanner uses the different levels of absorption to create the image. To make the images, several slices of the lung are made (horizontal slices), if we combine these slices, we can obtain the image of the lung in 3 dimensions.
The R package oro.dicom is a collection of input/output functions for medical imaging data that conform to the Digital Imaging and Communications in Medicine (DICOM) standard. The R package RIA is another package that was developed to facilitate radiomic analysis of medical images.
Now we will display 2 images from the same patient (that is three slices of a person’s lung). To read the images we use the function: readDICOMFile(path_to_the_image).
image1 <- readDICOMFile("/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00007637202177411956430/11.dcm")
image(t(image1$img), col=grey(0:64/64), axes=FALSE, xlab="", ylab="")
image2 <- readDICOMFile("/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00007637202177411956430/12.dcm")
image(t(image2$img), col=grey(0:64/64), axes=FALSE, xlab="", ylab="")
If we have nifit images, we can view them with thw library RNifti. And the function to view the image is: readNifti(path_to_image).
library(RNifti)
image <- readNifti("/Users/andrealetaalfonso/Desktop/TFG/DICOMS/rstudio-export/023140000001/023140000001_INSP_STD_L1_ECLIPSE.nii.gz")
view(image, point = c(239, 239, 131))
We can also use the library oro.nifti and neurobase. To view the images use: neurobase::readnii(path_to_image). For example, an image and a mask:
library(oro.nifti)
# Image
image2 <- neurobase::readnii("/Users/andrealetaalfonso/Desktop/TFG/DICOMS/rstudio-export/023140000001/023140000001_INSP_STD_L1_ECLIPSE.nii.gz")
image(image2,z=100,plot.type="single")
# Mask
mask <- neurobase::readnii("/Users/andrealetaalfonso/Desktop/TFG/DICOMS/rstudio-export/023140000001/023140000001_INSP_STD_L1_ECLIPSE_mask.nii.gz")
image(mask,z=100,plot.type="single")
And to overlay the image and the mask use the function: oro.nifti::slice_overlay(the_image, the_mask, z):
oro.nifti::slice_overlay(image2, mask, z=100)
First we will compute the radiomic features of the images without masks.
To compute the radiomic features of the images using the library RIA, we first need to convert the DICOM images to RIA_image class. To do so, we will use the function load_dicom(path_to_the_file).
If we have other image formats, there exists functions such as load_nifti(), load_nrrd() and load_npy() in RIA to read the images.
First, we compute the radiomic features of 1 image without a mask.
To compute the first-order statistics we use the first_order() function.
Some of the metrics computed are the mean, median, energy, entropy, and so on. RIA can calculate 44 different first-order statistics using the first_order function.
DICOM <- RIA::load_dicom(filename="/Users/andrealetaalfonso/Desktop/TFG/images/img/Folder_images/13", skipFirst128=FALSE, DICM=FALSE, boffset = 128)
data <- first_order(RIA_data_in = DICOM) # first order statistics
first_order <- RIA:::list_to_df(data$stat_fo$orig) # list of first order statistics
name_characteristics <- rownames(first_order) # names of the first order statistics
library(tidyverse)
With RIA we can also compute the Gray level co-occurrence matrix (GLCM) calculations and statistics. GLCMs are second-order statistics. We would use the function glcm.
DICOM = glcm(RIA_data_in = DICOM, off_right = 1, off_down = 2, off_z = 2)
The function glrlm calculates Gray level run length matrix (GLRLM) matrices of 2D and 3D images. GLRLM are higher-order statistics. We would use the function glrlm.
DICOM = glrlm(RIA_data_in = DICOM, off_right = 1, off_down = 0, off_z = 1)
To compute the GGeometry-based statistics, use the function geometry. RIA calculates 9 conventional geometrical parameters and 3 fractal dimensions.
DICOM = geometry(RIA_data_in = DICOM, use_orig = TRUE, calc_sub = FALSE)
RIA:::list_to_df(DICOM$stat_geometry$orig)
And finally, if we want to compute all the radiomic features at once, RIA has the radiomics_all to do it.
DICOM <- radiomics_all(DICOM, bins_in = c(8, 16, 32), equal_prob = "both")
Now we will compute the first order characteristics of 10 images (images 10 through 19) from the same patient (different slices) without a mask. We could do the following loop:
first_order <- c()
res <- c()
first_image <- 10 # first image to study
last_image <- 19 # last image to study
for (i in first_image:last_image) { # for loop through all the images
path <- file.path("/Users", "andrealetaalfonso", "Desktop","TFG", "images", "img", "Folder_images", i, fsep="/") # path of every image
DICOM <- RIA::load_dicom(filename=path, skipFirst128=FALSE, DICM=FALSE, boffset = 128) # load dicom image
DICOM = first_order(RIA_data_in = DICOM, use_type = "single") # compute first order characteristics
first_order[i] <- RIA:::list_to_df(DICOM$stat_fo$orig)
res[i-first_image+1] <- first_order[i] # data.frame counts from 1 to 10 (not first_image to last_image)
}
Finally, we combine all the results into a data.frame called ‘results’.
results <- do.call(cbind.data.frame, res)
colnames(results) <- seq(1, last_image-first_image+1)
rownames(results) <- name_characteristics
#results: results in a data.frame
print(as_tibble(results))
## # A tibble: 44 × 10
## `1` `2` `3` `4` `5` `6` `7` `8`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -338. -336. -325. -3.26e+2 -3.13e+2 -2.80e+2 -2.70e+2 -2.59e+2
## 2 -205 -196 -195 -1.87e+2 -1.85e+2 -1.44e+2 -1.25e+2 -1.21e+2
## 3 -1023 -1023 -1023 -1.02e+3 -1.02e+3 -1.02e+3 -1.02e+3 -1.02e+3
## 4 2757. 2598. 2672. 2.49e+3 2.53e+3 2.12e+3 1.86e+3 1.97e+3
## 5 0.0112 0.0114 0.0128 1.28e-2 1.52e-2 2.38e-2 2.92e-2 3.67e-2
## 6 399. 401. 415. 4.21e+2 4.33e+2 4.72e+2 4.76e+2 4.87e+2
## 7 -2143. -4731. -4591. -6.69e+3 6.31e+3 6.01e+3 2.20e+3 1.82e+3
## 8 -349. -345. -336. -3.34e+2 -3.22e+2 -2.85e+2 -2.72e+2 -2.65e+2
## 9 -350. -344. -337. -3.33e+2 -3.21e+2 -2.82e+2 -2.67e+2 -2.62e+2
## 10 -346. -338. -331. -3.26e+2 -3.14e+2 -2.70e+2 -2.51e+2 -2.48e+2
## # … with 34 more rows, and 2 more variables: `9` <dbl>, `10` <dbl>
As we explained in the introduction, segmentation refers to delineating the areas of interest in the image in terms of pixels or voxels, in our case, the lung. Lung segmentation refers to the computer-based process of identifying the boundaries of lungs from surrounding thoracic tissue on the scan images. Once we have the segmented images, we can start our analysis.
Now we will discuss different software to do lung segmentation.
We will use the library in R lungct. The lungct R package develops an image processing pipeline for computed tomography (CT) scans of the lungs.
First, we load the libraries needed to do the segmentation. We will use the library dcm2niir to convert DCM files to NIFTI.
library(lungct)
library(dcm2niir)
library(ANTsRCore)
Now we have to convert the data from DCM to NII. NIFTI (Neuroimaging Informatics Technology Initiative) is a data format for the storage of Functional Magnetic Resonance Imaging (fMRI) and other medical images.
res2 <- dcm2niir::dcm2nii(basedir="/Users/andrealetaalfonso/Desktop/TFG/DICOMS/id_1")
checked2 <- check_dcm2nii(res2) # Manipulating the output
Next we plot some images.
image2 <- antsImageRead(checked2)
plot(image2)
We create the masks.
mask2 <- segment_lung(image2)
And finally, we plot some of the masks.
plot(mask2$lung_mask)
Now we will create multiple .txt files with the results of the radiomic features of different patients, using a mask.
The IDs of the patients are the following.
patient1 <- "ID00007637202177411956430"
patient2 <- "ID00009637202177434476278"
patient3 <- "ID00010637202177584971671"
patient4 <- "ID00011637202177653955184"
patient5 <- "ID00012637202177665765362"
patient6 <- "ID00014637202177757139317"
patient7 <- "ID00015637202177877247924"
patient8 <- "ID00019637202178323708467"
patient9 <- "ID00020637202178344345685"
patient10 <- "ID00023637202179104603099"
vector_patients <- c(patient1, patient2, patient3, patient4, patient5, patient6, patient7, patient8, patient9, patient10) # list of patients ID
for(patient in vector_patients){
first_order <- c()
path <- file.path("/Users", "andrealetaalfonso", "Desktop","TFG", "images", "Kaggle", patient, fsep="/") # path of every image
number_files <- length(list.files(path)) # number of folders (ie. images) in each patient
for (i in 1:number_files) { # for loop through all the images
path <- file.path("/Users", "andrealetaalfonso", "Desktop","TFG", "images", "Kaggle", patient, i, fsep="/") # path of every image
res <- dcm2niir::dcm2nii(basedir=path) # Convert the data from dcm to nii
checked <- check_dcm2nii(res) # Manipulating the output
image <- antsImageRead(checked)
mask <- segment_lung(image) # Lung segmentation
DICOM <- RIA::load_dicom(filename=path, mask_filename=path, skipFirst128=FALSE, DICM=FALSE, boffset = 128) # load dicom image + mask
DICOM = first_order(RIA_data_in = DICOM, use_type = "single") # compute first order characteristics
first_order[i] <- RIA:::list_to_df(DICOM$stat_fo$orig)
}
results <- do.call(cbind.data.frame, first_order)
colnames(results) <- seq(1, number_files)
rownames(results) <- name_characteristics
write.table(results, row.names = TRUE, col.names=NA, file=paste0(patient,".txt"), sep="\t") # table with the results
}
Code adapted from: https://www.kaggle.com/arnavkj95/candidate-generation-and-luna16-preprocessing
Now we will do the segmentation but instead of using R, we will be using Python.
First we need to import some Python libraries.
import numpy as np # pip3 install numpy
import pandas as pd # pip3 install pandas
# pip3 install matplotlib
# pip3 install scipy
import skimage # pip3 install scikit-image
import os
from skimage.morphology import ball, disk, dilation, binary_erosion, remove_small_objects, erosion, closing, reconstruction, binary_closing
from skimage.measure import label,regionprops, perimeter
from skimage.morphology import binary_dilation, binary_opening
from skimage.filters import roberts, sobel
from skimage import measure, feature
from skimage.segmentation import clear_border
from skimage import data
from scipy import ndimage as ndi
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
import dicom # pip3 install dicom
import scipy.misc
import pydicom # pip3 install pydicom
import matplotlib.pyplot as plt
Each scan consist in multiple slices. We have all the DICOM images from the scan in one folder. In path_images we indicate the path of the folder.
from subprocess import check_output
path_images = "/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00184637202242062969203/"
# You can check that everything is loading correctly with: print(check_output(["ls", path_images]).decode("utf8"))
To read the images, we will use the function pydicom.read_file(). Then we will update the intensity values of -2000 with 0. These pixels are the ones that are located outside the scanner bounds.
# pip3 install nltk==3.6.2
lung = pydicom.read_file("/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00184637202242062969203/20.dcm")
slice = lung.pixel_array
plt.axis('off')
slice[slice == -2000] = 0
We create a function file_is_hidden() to read only .dcm files and not hidden files in the folder that we cannot see in our computers.
if os.name == 'nt':
import win32api, win32con
def file_is_hidden(p):
if os.name== 'nt':
attribute = win32api.GetFileAttributes(p)
return attribute & (win32con.FILE_ATTRIBUTE_HIDDEN | win32con.FILE_ATTRIBUTE_SYSTEM)
else:
return p.startswith('.') #linux-osx
file_list = [f for f in os.listdir(path_images) if not file_is_hidden(f)]
Now we will read all the images from a folder with a function named read_ct_scan(folder_name).
def read_ct_scan(folder_name):
# Read the slices from the dicom file
slices = [pydicom.read_file(folder_name + filename) for filename in os.listdir(folder_name) if not file_is_hidden(filename)]
# Sort the dicom slices in their respective order
slices.sort(key=lambda x: int(x.InstanceNumber))
# Get the pixel values for all the slices
slices = np.stack([s.pixel_array for s in slices])
slices[slices == -2000] = 0
return slices
ct_scan = read_ct_scan(path_images)
Plot some of the images from a folder.
def plot_ct_scan(scan):
f, plots = plt.subplots(int(scan.shape[0] / 20) + 1, 4, figsize=(25, 25))
for i in range(0, scan.shape[0], 5):
plots[int(i / 20), int((i % 20) / 5)].axis('off')
plots[int(i / 20), int((i % 20) / 5)].imshow(scan[i], cmap=plt.cm.gray)
plot_ct_scan(ct_scan)
plt.show()